Rachel the Robo Caller - Modeling

At DEF CON 22, the FTC ran a contest to help mitigate robocalls. There were three rounds, the last of which was using a set of call records collected from a robocall honeypot to determine if a caller was a robocaller. See Parts I and II of the contest for details on robocaller honeypots.

The FTC gave us two sets of data, that show a phone call from one "person" to another along with the date and time. Both collections have been randomized uniquely, but the portions of area code and subscriber number were kept the same.

This Notebook is a follow up to Analyzing Rachel the Robo Caller and details creating a Random Forest classifier to predict robocallers.



In [31]:

    
from IPython.display import Image
Image("http://www.ftc.gov/system/files/attachments/zapping-rachel/zapping-rachel-contest.jpg")









    Out[31]:

Initial setup



In [6]:

    
%matplotlib inline
# Standard toolkits in pydata land
import pandas as pd
import numpy as np

# Exploring the use of a RandomForest
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier



In [16]:

    
def read_FTC(dataset):
    '''Reads the csv format that the FTC provided for the Rachel the Robocaller contest into a pandas DataFrame'''
    return pd.read_csv(dataset,
                parse_dates=["DATE/TIME"],
                converters={'LIKELY ROBOCALL': lambda val: val == 'X'},
                dtype={'TO': str, 'FROM': str, 'LIKELY ROBOCALL': bool}
    )



In [34]:

    
def extract_features(ftc_row):
    ftc_row["HOUR"] = ftc_row["DATE/TIME"].hour
    ftc_row["MINUTE"] = ftc_row["DATE/TIME"].minute
    
    # Extract the area code using slicing since they are all regular US numbers
    ftc_row["TO_AREA_CODE"] = ftc_row["TO"][1:4]
    ftc_row["FROM_AREA_CODE"] = ftc_row["FROM"][1:4]
    
    # Extract area code + "office code"
    ftc_row["TO_OFFICE_CODE"] = ftc_row["TO"][1:7]
    ftc_row["FROM_OFFICE_CODE"] = ftc_row["FROM"][1:7]
    
    dt = ftc_row["DATE/TIME"]
        
    ftc_row["TIMECHUNK"] = dt.hour + np.floor(4*(dt.minute/60.0))/4
    
    return ftc_row



In [35]:

    
def total_call_volume(df, direction="FROM"):
    sizes = df.groupby(direction).size()

    def get_size(val):
        return sizes[val]

    return df[direction].apply(get_size)



In [36]:

    
def massage_ftc_dataframe(ftc_dataframe):
    massaged = ftc_dataframe.apply(extract_features, axis=1)
    
    massaged["NUM_FROM_CALLS"] = total_call_volume(massaged, "FROM")
    massaged["NUM_TO_CALLS"] = total_call_volume(massaged, "TO")
    
    return massaged



In [24]:

    
labeled_data = read_FTC("FTC-DEFCON Data Set 1.csv")
unlabeled_data = read_FTC("FTC-DEFCON Data Set 2.csv")



In [41]:

    
# This assumes you have the data locally
massaged_labeled_data = massage_ftc_dataframe(labeled_data)
massaged_labeled_data.head()









    Out[41]:






  
    
      
      TO
      FROM
      DATE/TIME
      LIKELY ROBOCALL
      HOUR
      MINUTE
      TO_AREA_CODE
      FROM_AREA_CODE
      TO_OFFICE_CODE
      FROM_OFFICE_CODE
      TIMECHUNK
      NUM_FROM_CALLS
      NUM_TO_CALLS
    
  
  
    
      0
       17866291260
       13055793696
      2014-04-01
       False
       0
       0
       786
       305
       786629
       305579
       0
       72
       70
    
    
      1
       14027826713
       12063339487
      2014-04-01
        True
       0
       0
       402
       206
       402782
       206333
       0
       55
        6
    
    
      2
       17083187970
       12246108402
      2014-04-01
       False
       0
       0
       708
       224
       708318
       224610
       0
       22
       28
    
    
      3
       17733095581
       13035009570
      2014-04-01
        True
       0
       0
       773
       303
       773309
       303500
       0
       22
       11
    
    
      4
       19188765408
       16153878533
      2014-04-01
        True
       0
       0
       918
       615
       918876
       615387
       0
       33
        2



In [42]:

    
massaged_unlabeled_data = massage_ftc_dataframe(unlabeled_data)
massaged_unlabeled_data.head()









    Out[42]:






  
    
      
      TO
      FROM
      DATE/TIME
      LIKELY ROBOCALL
      HOUR
      MINUTE
      TO_AREA_CODE
      FROM_AREA_CODE
      TO_OFFICE_CODE
      FROM_OFFICE_CODE
      TIMECHUNK
      NUM_FROM_CALLS
      NUM_TO_CALLS
    
  
  
    
      0
       16163847430
       13236069958
      2014-06-01
       False
       0
       0
       616
       323
       616384
       323606
       0
        7
       11
    
    
      1
       12025176283
       12029867020
      2014-06-01
       False
       0
       0
       202
       202
       202517
       202986
       0
       13
       48
    
    
      2
       18663049187
       15159256650
      2014-06-01
       False
       0
       0
       866
       515
       866304
       515925
       0
        1
        1
    
    
      3
       15594157085
       16199247140
      2014-06-01
       False
       0
       0
       559
       619
       559415
       619924
       0
       47
       10
    
    
      4
       18582407865
       19492012595
      2014-06-01
       False
       0
       0
       858
       949
       858240
       949201
       0
        1
       34



In [43]:

    
massaged_unlabeled_data.tail()









    Out[43]:






  
    
      
      TO
      FROM
      DATE/TIME
      LIKELY ROBOCALL
      HOUR
      MINUTE
      TO_AREA_CODE
      FROM_AREA_CODE
      TO_OFFICE_CODE
      FROM_OFFICE_CODE
      TIMECHUNK
      NUM_FROM_CALLS
      NUM_TO_CALLS
    
  
  
    
      201515
       14435522376
       14436426683
      2014-06-06 23:59:00
       False
       23
       59
       443
       443
       443552
       443642
       23.75
        66
        98
    
    
      201516
       17325876492
       15169325497
      2014-06-06 23:59:00
       False
       23
       59
       732
       516
       732587
       516932
       23.75
       425
         9
    
    
      201517
       14159683941
       12241676708
      2014-06-06 23:59:00
       False
       23
       59
       415
       224
       415968
       224167
       23.75
        28
       193
    
    
      201518
       16204321022
       17853507101
      2014-06-06 23:59:00
       False
       23
       59
       620
       785
       620432
       785350
       23.75
        16
        18
    
    
      201519
       13475865534
       15708781910
      2014-06-06 23:59:00
       False
       23
       59
       347
       570
       347586
       570878
       23.75
         2
         2



In [44]:

    
massaged_labeled_data.columns









    Out[44]:





Index([u'TO', u'FROM', u'DATE/TIME', u'LIKELY ROBOCALL', u'HOUR', u'MINUTE', u'TO_AREA_CODE', u'FROM_AREA_CODE', u'TO_OFFICE_CODE', u'FROM_OFFICE_CODE', u'TIMECHUNK', u'NUM_FROM_CALLS', u'NUM_TO_CALLS'], dtype='object')

Now to build our Random Forest and see how it fares



In [53]:

    
# Scoring system for contest
# Not 0-1 loss...
def score(our_predictions, true_results):
    '''Scoring system for the FTC contest. Not 0-1 loss.'''
    our_score = 0
    for i in range(len(true_results)):
        if (our_predictions[i] == True and true_results[i] == True):
            our_score += 1
        if (our_predictions[i] == True and true_results[i] == False):
            our_score -= 1
    return our_score


# features is only a copy of the dataframe, can't use this
#def label_encode(features, feature_name):
#    feature_encoder = preprocessing.LabelEncoder()
#    features[feature_name] = feature_encoder.fit_transform(features[feature_name])
#    return feature_encoder

def enriched_data_to_features(enriched_data):
    '''Takes a pandas DataFrame with enriched FTC data, returns features and target labels.'''
    categorical_feature_names = [

            "TO_AREA_CODE",
            "FROM_AREA_CODE",
            "TO_OFFICE_CODE",
            "FROM_OFFICE_CODE",

            #"TOTZ",
            #"FROMTZ",
            #"SAMEAREACODE",
            #"WITHIN_THREE_MINUTES",
            #"FROMVALID",
            "TIMECHUNK",
            #"ISWEEKDAY", # Undecided on whether this will generalize since
                          # training and test data have different weekdays
                          # And the labeled data is missing Mondays
    ]
    
    numerical_feature_names = ["NUM_FROM_CALLS", "NUM_TO_CALLS"]
    
    feature_names = categorical_feature_names + numerical_feature_names

    features = enriched_data[feature_names]
    
    for feature_name in categorical_feature_names:
        print("Creating categorical feature {}".format(feature_name))
        encoder = preprocessing.LabelEncoder()
        features[feature_name] = encoder.fit_transform(features[feature_name])
    
    target = enriched_data["LIKELY ROBOCALL"].values
    
    return features, target
    

def train(features, target, min_samples_split=285):
    classifier = RandomForestClassifier(n_estimators=200, 
                                        verbose=0,
                                        n_jobs=-1,
                                        min_samples_split=min_samples_split,
                                        random_state=1,
                                        oob_score=True)

    classifier.fit(features, target)
    print("Resulting OOB Score: {}".format(classifier.oob_score_))

    return classifier



In [48]:

    
# Separate into training and test sets based on FROM
# This won't be needed when reading in testing data set;
# for that, train on full data and then use .predict()

from_numbers = massaged_labeled_data["FROM"].unique()

# 70% / 30%
num_train = int(round(.7 * len(from_numbers)))
num_test = len(from_numbers) - num_train
train_samples = np.random.choice(from_numbers, num_train)


train_data = massaged_labeled_data[massaged_labeled_data['FROM'].isin(train_samples)]
test_data = massaged_labeled_data[~massaged_labeled_data['FROM'].isin(train_samples)]



In [58]:

    
# For development
print("Enriching Training Data")
train_features, train_target = enriched_data_to_features(train_data)
print("Enriching Testing Data")
test_features, test_target = enriched_data_to_features(test_data)

min_samples_split_values = np.arange(150, 290, 5)

num_parameter_trials = len(min_samples_split_values)

# create dataframe
score_frame = pd.DataFrame(index=np.arange(0, num_parameter_trials), columns=('min_samples_split', 'test_score', 'train_score') )

for trial in np.arange(0, num_parameter_trials):
    
    c = min_samples_split_values[trial]
    classifier = train(train_features, train_target, c)
    our_predictions = classifier.predict(test_features)
    our_train_predictions = classifier.predict(train_features)
    
    score_frame.loc[trial] = [c, score(our_predictions, test_target), score(our_train_predictions, train_target) ]









    



Enriching Training Data
Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Enriching Testing Data
Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Resulting OOB Score: 0.925460738398
Resulting OOB Score: 0.924730855787
Resulting OOB Score: 0.925460738398
Resulting OOB Score: 0.92317985524
Resulting OOB Score: 0.923605620096
Resulting OOB Score: 0.921400766377
Resulting OOB Score: 0.919925795268
Resulting OOB Score: 0.919500030412
Resulting OOB Score: 0.919925795268
Resulting OOB Score: 0.918967824342
Resulting OOB Score: 0.917386412019
Resulting OOB Score: 0.917720941549
Resulting OOB Score: 0.916413235205
Resulting OOB Score: 0.914938264096
Resulting OOB Score: 0.915440058391
Resulting OOB Score: 0.915151146524
Resulting OOB Score: 0.913235204671
Resulting OOB Score: 0.913797822517
Resulting OOB Score: 0.911334468706
Resulting OOB Score: 0.911516939359
Resulting OOB Score: 0.911668998236
Resulting OOB Score: 0.911349674594
Resulting OOB Score: 0.91071102731
Resulting OOB Score: 0.90897755611
Resulting OOB Score: 0.908947144334
Resulting OOB Score: 0.906924761268
Resulting OOB Score: 0.90604281978
Resulting OOB Score: 0.906894349492



In [59]:

    
score_frame









    Out[59]:






  
    
      
      min_samples_split
      test_score
      train_score
    
  
  
    
      0 
       150
       4578
       17775
    
    
      1 
       155
       4734
       17659
    
    
      2 
       160
       4651
       17687
    
    
      3 
       165
       4691
       17573
    
    
      4 
       170
       4693
       17554
    
    
      5 
       175
       4671
       17403
    
    
      6 
       180
       4666
       17343
    
    
      7 
       185
       4711
       17326
    
    
      8 
       190
       4635
       17274
    
    
      9 
       195
       4828
       17244
    
    
      10
       200
       4626
       17143
    
    
      11
       205
       4689
       17149
    
    
      12
       210
       4929
       17074
    
    
      13
       215
       4688
       16988
    
    
      14
       220
       4686
       16975
    
    
      15
       225
       4855
       16924
    
    
      16
       230
       4641
       16907
    
    
      17
       235
       4753
       16885
    
    
      18
       240
       4661
       16701
    
    
      19
       245
       4731
       16745
    
    
      20
       250
       4778
       16702
    
    
      21
       255
       4776
       16614
    
    
      22
       260
       4918
       16619
    
    
      23
       265
       4897
       16552
    
    
      24
       270
       4982
       16482
    
    
      25
       275
       5000
       16370
    
    
      26
       280
       4798
       16289
    
    
      27
       285
       4816
       16360



In [64]:

    
# For the sake of the contest now, we'll train on the entire FTC1 dataset and then fit to the FTC2 dataset

train_data = massaged_labeled_data
test_data = massaged_unlabeled_data

train_features, train_target = enriched_data_to_features(train_data)

test_features, _ = enriched_data_to_features(test_data)

c = 285 # Determined during contest, not sure of now that I've run through it again
classifier = train(train_features, train_target, c)
predictions = classifier.predict(test_features)

contest_results = unlabeled_data[["FROM", "TO", "DATE/TIME"]]

contest_results["LIKELY ROBOCALL"] = predictions
contest_results["LIKELY ROBOCALL"] = X["LIKELY ROBOCALL"].map(lambda x: "X" if x else "")
contest_results.to_csv("predictions.csv", index=False)









    



Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Resulting OOB Score: 0.910907595204



In [65]:

    
!ls









    



Analyzing Rachel the Robo Caller.ipynb	Modeling Rachel the Robo Caller.ipynb
Dockerfile				predictions.csv
enrich.ipynb				rachel.py
FTC-DEFCON Data Set 1.csv		README.md
FTC-DEFCON Data Set 2.csv		requirements.txt
LICENSE					Untitled0.ipynb



In [ ]:

	TO	FROM	DATE/TIME	LIKELY ROBOCALL	TO_AREA_CODE	FROM_AREA_CODE	TO_OFFICE_CODE	FROM_OFFICE_CODE	NUM_FROM_CALLS	NUM_TO_CALLS
0	17866291260	13055793696	2014-04-01	False	786	305	786629	305579	72	70
1	14027826713	12063339487	2014-04-01	True	402	206	402782	206333	55	6
2	17083187970	12246108402	2014-04-01	False	708	224	708318	224610	22	28
3	17733095581	13035009570	2014-04-01	True	773	303	773309	303500	22	11
4	19188765408	16153878533	2014-04-01	True	918	615	918876	615387	33	2

	TO	FROM	DATE/TIME	LIKELY ROBOCALL	TO_AREA_CODE	FROM_AREA_CODE	TO_OFFICE_CODE	FROM_OFFICE_CODE	NUM_FROM_CALLS	NUM_TO_CALLS
0	16163847430	13236069958	2014-06-01	False	616	323	616384	323606	7	11
1	12025176283	12029867020	2014-06-01	False	202	202	202517	202986	13	48
2	18663049187	15159256650	2014-06-01	False	866	515	866304	515925	1	1
3	15594157085	16199247140	2014-06-01	False	559	619	559415	619924	47	10
4	18582407865	19492012595	2014-06-01	False	858	949	858240	949201	1	34

	TO	FROM	DATE/TIME	LIKELY ROBOCALL	HOUR	MINUTE	TO_AREA_CODE	FROM_AREA_CODE	TO_OFFICE_CODE	FROM_OFFICE_CODE	TIMECHUNK	NUM_FROM_CALLS	NUM_TO_CALLS
201515	14435522376	14436426683	2014-06-06 23:59:00	False	23	59	443	443	443552	443642	23.75	66	98
201516	17325876492	15169325497	2014-06-06 23:59:00	False	23	59	732	516	732587	516932	23.75	425	9
201517	14159683941	12241676708	2014-06-06 23:59:00	False	23	59	415	224	415968	224167	23.75	28	193
201518	16204321022	17853507101	2014-06-06 23:59:00	False	23	59	620	785	620432	785350	23.75	16	18
201519	13475865534	15708781910	2014-06-06 23:59:00	False	23	59	347	570	347586	570878	23.75	2	2

	min_samples_split	test_score	train_score
0	150	4578	17775
1	155	4734	17659
2	160	4651	17687
3	165	4691	17573
4	170	4693	17554
5	175	4671	17403
6	180	4666	17343
7	185	4711	17326
8	190	4635	17274
9	195	4828	17244
10	200	4626	17143
11	205	4689	17149
12	210	4929	17074
13	215	4688	16988
14	220	4686	16975
15	225	4855	16924
16	230	4641	16907
17	235	4753	16885
18	240	4661	16701
19	245	4731	16745
20	250	4778	16702
21	255	4776	16614
22	260	4918	16619
23	265	4897	16552
24	270	4982	16482
25	275	5000	16370
26	280	4798	16289
27	285	4816	16360